6. Using Gaussian Proceses

  • hyperparameter tuning

  • application to case study

  • interpretability evaluation

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.

Reading

  • Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A visual exploration of Gaussian processes. Distill, 4(4). doi:10.23915/distill.00017

  • Deisenroth, M., Luo, Y., & van der Wilk, M. (2020). A practical guide to Gaussian processes. https://infallible-thompson-49de36.netlify.app/.

Hyperparameter Optimization

Marginal Likelihood

Q: How do we choose kernel hyperparameters \(\theta = \left(\ell, \sigma_f, \dots\right)\)

A: Find \(\theta\) that makes the observed data \(\mathbf{y}\) most probable:

Marginal likelihood

\[p(\mathbf{y} | \mathbf{X}, \theta) = \int p(\mathbf{y} | \mathbf{f}, \mathbf{X}) p(\mathbf{f} | \mathbf{X}, \theta) d\mathbf{f}\]

This “marginalizes” over all possible functions \(\mathbf{f}\) with hyperparameter \(\theta\).

Marginal Likelihood

For the GP model, this has a closed form:

\[\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2\mathbf{I}| + \text{const}\]

This has two competing terms – data fit vs. model complexity.

Automatic Occam’s Razor

Consider fitting a sine wave with noise. How would you choose the RBF \(\ell\)?

Very small \(\ell\) (underfitting)

  • Will interpolate the noise, not just the cuve
  • Spreads probability over too many datasets
  • Low marginal likelihood

Very large \(\ell\) (overfitting)

  • Will oversmooth the sine component
  • Low marginal likelihood

The right choice of \(\ell\) will be flexible enough to fit the sine curve but not the noise. The marginal likelihood finds this automatically without any cross-validation.

Marginal Likelihood Surface

  • Non-convex: multiple local optima
  • Each optimum corresponds to different explanation of data
  • Gradient-based optimization sensitive to initialization

Initializing Signal Variance

Observation = Signal + Noise

\[\begin{align*} y = f(x) + \epsilon \end{align*}\]

Variance Decomposition

\[\begin{align*} \underbrace{\operatorname{Var}[y]}_{\text {what we see }}=\underbrace{\sigma_f^2}_{\text {signal }}+\underbrace{\sigma_n^2}_{\text {noise }} \end{align*}\]

Initializing Signal Variance

Heuristic 1: If the instrument’s precision is known (e.g., Kepler telescope noise \(\sigma_n \approx 10^-4\)), then set \[\begin{align*} \sigma_f^2=\operatorname{Var}[y]-\sigma_n^2 \end{align*}\]

Heuristic 2: Suppose the signal dominates by some assumed SNR factor \(\kappa\) (e.g., 2 - 100). Then set \[\begin{align*} \sigma_{f}^{2} \approx \operatorname{Var}[y]\\ \sigma_{n}^{2} \approx \frac{\sigma_f^2}{\kappa^2} \end{align*}\]

Initializing Lengthscale

Lengthscale controls how far we travel before \(f(x)\) and \(f(x')\) decorrelate

Heuristic: Set lengthscale relative to spread of input data \[\ell \approx \lambda \cdot \text{SD}[\mathbf{X}], \quad \lambda \in [0.2, 10]\]

Alternative: Use median distance from mean \[\ell \approx \text{median}\{|x_i - \bar{x}|\}\]

Case Study: Steller Variability

Data Characteristics

> dim(data)
[1] 3822    3

Periodogram

spec <- spectrum(data$flux, pad = 6, plot = FALSE)

There seem to be two periodicities that dominate.

Rotation Kernel for Stellar Variability

\[k_{\text{rot}} = k_{\text{periodic}} \times k_{\text{decay}}\]

Physical interpretation:

  • Periodic component: rotation
  • Decay component: timescale for spot changes
  • Product: spots appear/disappear as star rotates

The original source (Angus et al. (2017)) used a slightly more involved periodoc kernel. It used a similar domain-informed kernel construction though.

Kernel Specification

This is how we can implement the kernel in R.

k_periodic <- periodic(period = exp(logperiod),
                       lengthscale = 1,
                       variance = exp(logamp))
k_decay <- mat52(lengthscale = exp(logdecay),
                 variance = 1)
full_kernel <- k_periodic * k_decay + white(exp(logs2))

Hyperparameters

  • logperiod: \(\log p\) (rotation period)
  • logdecay: \(\log \ell_{\text{decay}}\) (spot lifetime)
  • logamp: \(\log \sigma_f^2\) (signal variance)
  • logs2: \(\log \sigma_n^2\) (noise variance)

Prior Specification

mean_flux <- normal(0, 1)
logs2 <- normal(-1, 2)
logamp <- normal(log(var(data$flux)), 2)
logperiod <- normal(log(25), 2)
logdecay <- normal(log(25 * 3), 2)

Rationale

  • logperiod ~ N(log(25), 2): Centered at periodogram guess, but wide
  • logdecay ~ N(log(75), 2): Lasts ~3 rotation periods
  • logamp ~ N(log(Var[y]), 2): Signal variance ~ data variance
  • logs2 ~ N(-1, 2): Relatively small noise

Maximum A Posteriori (MAP) Estimate

gp_model <- gp(data$time, full_kernel)
distribution(data$flux) <- normal(gp_model + mean_flux,
                                   data$flux_err)
m <- model(logperiod, logdecay, logamp, logs2, mean_flux)

map_soln <- opt(m, initial_values = start_vals)

This helps us find a mode of the posterior as starting point for MCMC.

Posterior Distribution: Rotation Period

period_samples <- exp(draws[, "logperiod"])

Posterior Predictive Samples

Each draw here is a plausible function \(f(t)\) consistent with data and kernel

time_grid <- seq(min(data$time), max(data$time), length.out = 1000)
gp_array <- project(gp_model, time_grid)
gp_samples <- calculate(gp_array, values = draws)[[1]]

project evaluates the GP posterior distribution at the specified time_grid.

Posterior Predictive Visualization

Observations

  • Periodic structure well-captured
  • Uncertainty narrow near data (compare with later excercise)

Case Study Takeaways

The kernel is a hypothesis about the data generating process.

  • Quasiperiodicity -> rotating star with evolving spots
  • Not an arbitrary computational choice

What we learned

  • Rotation period
  • Spot evolution
  • Subtract fitted stellar variability -> potential exoplanet detection

The GP is not just fitting curves, it’s extracting astrophysics.

Interpretability

Interpretability Criteria

  • Predictive accuracy: Can we approximate complex functions?
  • Parameter meaning: Do parameters correspond to scientific quantities?
  • Uncertainty: Does model know what it doesn’t know?
  • Visualization: Can you draw what the model believes?
  • Falsifiability: Can you identify when the model is wrong?

GPs do well on all five… deep learning fails 2 - 5.

Comparison with Linear Regression

Linear regression

  • Parametric: \(y = \beta_0 + x^\top \beta\).
  • Interpretable: “\(\beta_{1}\) is the effect of \(x_{1}\) all else held equal”
  • Global: same \(\beta\) everywhere
  • Limited: cannot capture nonlinearity without feature engineering

Gaussian processes

  • Nonparametric: no fixed functional form
  • Interpretable Hyperparameters: \(\ell\) (scale), \(p\) (period), …
  • Local: predictions weighted by kernel
  • Flexible: captures complex patterns

Nonparametric but Interpretable

  • Nonparametric -> complexity grows as the number of samples increases
  • Not “effect sizes” but “characteristic scales”
  • “How quickly does \(f\) vary?” not “How much does \(x\) affect \(y\)?”

Both GPs and linear regression are interpertable, but in different senses.

Limitations and Extensions

Limitations:

  • 1D time series only
  • Observed data are Gaussian
  • Stationary/periodic assumptions

Extensions:

  • Non-Gaussian observations: Maybe we want a Student-t likelihood for robustness or softmax layer for classification.
  • Multidimensional inputs or outputs: We might have other relvant measurements for each star, and may want to model several stars at once.

Exercise

[GP Missing Data] Rerun the case study notebook with a window of observations missing. How does it affect the posterior on the period? What do the posterior predictive samples look like?

data_model <- data |>
  arrange(time) |>
  filter(time < 1210 | time > 1220) |>
  dplyr::slice(seq(1, n(), by = 10))
Angus, Ruth, Timothy Morton, Suzanne Aigrain, Daniel Foreman-Mackey, and Vinesh Rajpaul. 2017. “Inferring Probabilistic Stellar Rotation Periods Using Gaussian Processes.” https://doi.org/10.48550/ARXIV.1706.05459.
Gaussian Process Models for Stellar Variability &#X2014; Exoplanet 0.1.3 Documentation — Docs.exoplanet.codes.” https://docs.exoplanet.codes/en/v0.1.3/tutorials/stellar-variability/.